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We have analyzed the dependence ol average ground state energy per monomer, e, ol the complex 
of two random heteropolymers with quenched sequences, on chain length, n, in the ensemble of 
chains with uniform distribution of primary sequences. Every chain monomer is randomly and 
independently chosen with the uniform probability distribution p = 1/c from a set of c different 
types A, B, C, D, .... Monomers of the first chain could form saturating reversible bonds with 
monomers of the second chain. The bonds between similar monomer types (like A-A, B-B, C-C, 
etc.) have the attraction energy u, while the bonds between different monomer types (like A-B, 
A-D, B-D, etc.) have the attraction energy v. The main attention is paid to the computation 
of the normalized free energy e(n) for intermediate chain lengths, n, and different ratios a = - 
at sufficiently low temperatures when the entropic contribution of the loop formation is negligible 
compared to direct energetic interactions between chain monomers and the partition function of the 
chains is dominated by the ground state. The performed analysis allows one to derive the force, 
/, which is necessary to apply for unzipping of two random heteropolymer chains of equal lengths 
whose ends are separated by the distance x, averaged over all equally distributed primary structures 
at low temperatures for fixed values a and c. 

PACS numbers: 02.50.-r, 05.40.-a, 87.10.-e, 87.15.Cc 

I. INTRODUCTION 

Recent progress in nanotcchnology has offered a possibility of single-molecular experiments. The corresponding 
technique allows one to investigate many physico-chemical and biological properties of individual molecules. One 
of the modern biophysical key experiments deals with the mechanical unzipping of individual double stranded DNA 
macromolecule under the action of external force applied to the ends of strands. This question has been analyzed 
theoretically in a number of important contributions [l|, 0, H, 0, [H, H, 0, B Hi ■ Some of them are devoted to the 
consideration of unzipping transition in an effective homopolymcr chain, the other pay attention to the heterogeneity 
of primary sequence of complimentary strands constituting the DNA molecule. 

In our work we address a problem of unzipping of a complex of two random heteropolymers of finite lengths at 
sufficiently low temperatures when the partition function is dominated by the ground state. We demonstrate that 
this problem can be mapped to the problem of alignment of two random sequences with the general "cost function" 
which takes into account the weights of perfect matches, mismatches and gaps (all necessary definitions are introduced 
below). Using this bijection we are able to compute the external work necessary to unzip the complex of two random 
heteropolymers, averaged over the uniform distribution of all possible primary sequences of heteropolymers. Our 
consideration allows also to conjecture the scaling corrections to the leading behavior of the force fluctuations due to 
the finiteness of the lengths of heteropolymer chains. 

The paper is organized as follows. In Section [TT] we define a model under consideration and introduce the basic 
notations. In Section IlIII we consider unzipping of two random heteropolymers from the point of view of the search 
1 of Longest Common Subsequence (LcS) of two random sequences. The expectation of the LCS energy is considered 
^ , in Section IIVI In Conclusion we give the qualitative explanation of our main results and derive a force, which is 
- - ■ necessary to apply to the chain ends to unzip two random heteropolymer chains at low temperatures. 



II. THE MODEL 



Consider two random heteropolymer chains of lengths L\ = ml and Li = n£ correspondingly. In what follows we 
shall measure the lengths of the chains in number of monomers, m and n, supposing that the size of an elementary 
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unit, i, is equal to 1. Every monomer can be randomly and independently chosen with the uniform probability 
distribution p = - from a set of c different types A, B, C, D, ... . Monomers of the first chain could form saturating 
reversible bonds with monomers of the second chain. The term "saturating" means that any monomer can form a 
bond with at most one monomer of the other chain. The bonds between similar types (like A- A, B-B, C-C, etc.) 
have the attraction energy u and are called below "matches", while the bonds between different types (like A-B, A-D, 
B-D, etc.) have the attraction energy v and are called "mismatches". Some parts of the chains could form loops 
hence contributing to the entropic part of the free energy of the system. Schematically a particular configuration of 
the system under consideration for c = 2 is shown in Fig[TJ 




Figure 1: Schematic picture of a complex of two random heteropolymer chains. 



Our aim is to compute the free energy of the described model at sufficiently low temperatures when the entropic 
contribution of the loop formation is negligible compared to the energetic part of the direct interactions between chain 
monomers. 

Consider now the partition function of such a complex G m , n which is the sum over all possible arrangements of 
bonds. Since we are interested in the low-temperature behavior of G m ,m we neglect the entropic contribution of the 
loop weights which allows to write G m>n recursively in terms of the partition functions of individual chains g(n): 

m,n 

G m ,n = g(m)g(n) + V p itj G i - 1 , j - 1 g(m - i)g(n - j) 

i% (1) 

Gm,o = g{m); G ,„ = g{n); G 0: o = 1 

The meaning of the equation (JTJ is as follows. Starting from, say, the left ends of the chains shown in Fig[T]we find the 
first actually existing contact between the monomers i (of the first chain) and j (of the second chain) and sum over 
all possible arrangements of this first contact. The first term "1" in fl} means that we have not found any contact at 
all. The entries j3^j (1 < i < m, 1 < j < n) are the statistical weights of the bonds which are encoded in a contact 
map {/?}: 



u/t jj? monomers i and j match 
e v > 11 if monomers i and j do not match 



_ i jj = o • ii iiiuiiumcio o aim j hicillii 

Pm,n — \ _ _ o u/T •-£ -• i -• J a i-i- V • / 



For a system of two heteropolymer chains depicted in FigQ]the contact map {(3} is shown in Fig[2j 

Generally speaking, if one allows for loop formation within a single chain, the partition functions of individual 
chains g(n) satisfy, in turn, recurrent equation [ill . Il2l ? ] 

n— 1 n 

g(n) = 1 + J2 E PlMi ~ 1 - - S)\ .9(0) = 1 (3) 

i— 1 

where (3[ ^ are the constants of self-association, which are, similarly to P m ,ni random variables encoded by some contact 
map. However, for the sake of simplicity, we assume in this paper that there is no self-association in the systems 
under discussion, and thus 

p' = 0; g(n) = l (4) 



This simplification is, in fact, rather significant from the mathematical point of view since we replace the quadratic 
set of equations ((T|), © with the linear set |T]), However, we expect the intrachain association to be effectively 
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suppressed due to the finite flexibility of the single chains, and, therefore, the typical values of f3' to be much less then 
the typical values of /?. We expect thus the interchain association to give just some minor corrections to the results 
obtained below. 

The case of non-zero f3's was thoroughly investigated recently (l3l . [lil . Il5j in a set-up when the interactions in the 
system are predetermined instead of random and we refer the reader to these papers for more detail. 



AABAB B BAAB BAAABAAB 



weight e"' T 
I □ weight e vlT 



Figure 2: Contact map {/3} corresponding to the complex of two random heteropolymer chains shown in FigfJ] 



III. UNZIPPING OF TWO RANDOM HETEROPOLYMERS AND SEARCH OF LONGEST COMMON 

SUBSEQUENCE (LCS) OF TWO RANDOM SEQUENCES 

A. Heteropolymer ground state energy: local recursive construction 



The straightforward computation shows that the partition function G m ,n obeys the following exact local recursion 

Gm,n Ctfi.-l,n ~t~ G m ^ n — 1 -\- (0m,n 1) Cm— l,n— 1 (5) 

Note that if fii.j — 2 for all 1 < i < m and 1 < j < n, the recursion relation ([5]) generates the so-called Delannoy 
numbers UM. 



Represent now the partition function G„ i; „ in the following way 

Q _ e F m , n /T 



(G) 



where — F n>m has the sense of the free energy and T stands for the temperature of the complex of two heterogeneous 
chains of lengths m and n. Considering the T — > limit, we get 



F m>n = lim Tin ( e F ^^ T + e F ^^' T + (/3 m ,„ - 1) , 



,F m _ 1 ,„_ 1 /T 



(7) 



which can be regarded as the equation for the ground state energy of a chain. The expression (J7J) can be rewritten in 
a symbolic form 



where 



-^m,n — max [-Fm-l.Tii -^m,n— 1; ^?n— l,n— 1 "h T]m,n\ 

i] + = Tln(e"/ T — 1) in case of match 



Vrn,n = T\n(/3 m ^ n - 1) = 



rj = Tln(e"/ T — 1) in case of mismatch 
Taking ry + as the unit of the energy, we can rewrite ([8]) as follows 

Fm^ri max F m —\ n7 F m n — i, F m —\ n — \ -\- fjm,n 



(8) 



(9) 



(10) 
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where 

!1 in case of match 

rf~ . . (11) 

a = — - in case of mismatch 
77+ 

In the low-temperature limit the parameter a has simple expression in terms of coupling constants u and v: 

_ rf \n{e v / T - 1) 



= I < 12 > 



rj+ ln(e"/ T - 1) 

Finally, the initial conditions for F m ^ n transform due to the second of equations ([1]) into 

Fo, n = F n fi = i*b,o = (13) 



B. Matching with gaps: the cost function 



In Eqs.(|5|)- (fT3")) we can recognize the recursive algorithm [T3, EH for the determination of the length F„ hn of the 
Longest Common Subsequence (LCS) of two arbitrary sequences of lengths m and n. It is easy to sec that the search 
of F m .„ can be completed in polynomial time ~ 0(mn). 

Recall that the problem of finding the LCS in a pair of sequences drawn from alphabet of c letters is formulated 
as follows. Consider two sequences a = {ai, ct2, • • • , otm) (of length m) and j3 = {fti, fa, . . . , /3 n } (of length n). 
For example, let a and (3 be two random sequences of c = 4 base pairs A, C, G, T of a DNA molecule, e.g., 
a = {A, C, G, C, T, A, C} with m = 6 and j3 = {C, T, G, A, C} with n = 5. Any subsequence of a (or (3) is an ordered 
sublist of a (and of 0) entries which need not to be consecutive, e.g, it could be {C,G,T, C}, but not {T, G, C}. 
A common subsequence of two sequences a and j3 is a subsequence of both of them. For example, the subsequence 
{C,G,A,C} is a common subsequence of both a and (5. There are many possible common subsequences of a pair of 
initial sequences. The aim of the LCS problem is to find the lon gest of them. This problem and its variants have been 
widely studied in biology [l§ E3, HH 111 jComputer science 0, Si Hi, HI , probability theory [H, US HI, [U, M HH 
and more recently in statistical physics [18|, [33 . |33| . A particularly important application of the LCS problem is to 
quantify the closeness between two DNA sequences. In evolutionary biology, the genes responsible for building specific 
proteins evolve with time and by finding the LCS of similar genes in different species, one can learn what has been 
conserved in time. Also, when a new DNA molecule is sequenced in vitro, it is important to know whether it is really 
new or it is similar to already existing molecules. This is achieved quantitatively by measuring the LCS of the new 
molecule with other ones available from database. 

In the simplest version of the LCS problem only the number of perfect matches is taken into account, i.e. there 
is 110 difference between mismatches and gaps. One can, however, easily construct a generalized model where this 
difference comes into play. Let us introduce the general "cost function", 5, having a meaning of an energy (see, for 
example [13, H3] for details) 

S = A match + n N mis + 6 N gap (14) 

In Q14[) iV matc h, A^irus and iVg ap are correspondingly the numbers of matches, mismatches and gaps in a given pair of 
sequences — see Fig[3l and \x and S are respectively the energies of mismatches and gaps. Without the loss of generality 
the energy of matches can be always set to 1. Besides (fl"4")l we have an obvious conservation law 

n + m = 27V match + 27V mis + A^ gap (15) 

which allows one to exclude ^V gap from (fT4")) and rewrite this expression as follows: 

S = iVmatch + l*N mia + S(n + m- 2N match - 2iV mis ) = (1 - 25)A^ matc h + 26)N mis + const (16) 

In p6p the irrelevant constant S(n + m) can be dropped out. 

Now we can adopt (1 — 28) as a unit of energy. Finally we arrive at the following expression 



S = Amatch + 7A r mis 



(17) 
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where 

and 7 < 1 by definition. The interesting region is < 7 < 1, since otherwise there are no mismatches at all in the 
ground state (i.e., there is no difference between 7 = 0, which corresponds to simplest version of the LCS problem, 
and 7 < 0). 



J - match; 



AABABBBAABBAAABAAB 

J ■ .-'y' "\x * " m ' sma * c h; 

ABAABBABBBABABB f nan 

' — - — • I - gap 

Figure 3: Matches, mismatches and gaps in a pair of sequences corresponding to the configuration of two random heteropolymers 
shown in Figfj] 

It is known [34|, [3?| that the ground state energy 

5 max = max [A match + 7 iV mis ] (19) 

satisfies the recursion relation 

(^max _ mav omax cmax r^max , /■ /qon 

°rn,n ~ mdx °m-l,-n> °m,n-l; °OT-l,n-l ' Sm,n ) 

with 

1 in case of match ,„ „ 

Cm n = { . . (21) 

I 7 in case of mismatch 

Indeed, the ground state may correspond cither (i) to the last two monomers connected, then the ground state energy 
equals S™^\ n _ t + Cm,n, or (ii) to the unconnected end monomer of the fist (or second) chain, then the ground state 
energy is 5™^_ x (or S^J. 

Comparing Eqs (|20|) . (|2Tj) with Eqs. lfTO]) . (fTTj) one sees that they are identical up to the exchange of variables 7 <-> a. 
This establishes the analogy between initial heteropolymer problem formulated in (jTJ) ([2]) in the low-temperature 
limit (|10p and the standard matching problem with general cost function |T] 



For a pair of fixed sequences of lengths m and n, the cost function <S™ a * is just a number. In the stochastic version 
of the LCS problem one compares two random sequences drawn from alphabet of c letters and hence the cost function 
5™ ax is a random variable. We are interested in the computation of the expectation and the variance of 5™ ax for 
m — n 3> 1 and the interpretation of the obtained results for LCS in terms of initial problem of unzipping of two 
random heteropolymers. 



C. Bernoulli model for heteropolymers 



We should note that the variables rj m ^ n in ([5]) are not independent of each other. Actually, consider a simple example 
of two strings a = AB and (3 = AA. One has by definition: 77^1 = 77^2 = 1 and 772,1 = 0. The knowledge of these three 
variables is sufficient to predict that the last two letters do not match each other, i.e., 772.2 = 0. Thus, 7)2,2 can not 
take its value independently of 7)1,1, 771,2, V2, i- These residual correlations between the 77^ variables make the LCS 
problem very complicated. However for two random sequences drawn from the alphabet of c letters, the correlations 
between the ?),„,„ variables vanish for c — > 00. 

In our work we restrict ourselves with the so-called Bernoulli matching (BM) model [l8| (which is simpler but yet 
nontrivial variant of the original LCS problem) where one ignores the correlations between 77 m ,„ for all c. The cost 
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function i^Jf of the BM model satisfies the same recursion relation ([8]) except that the fj mtn 's are now independent 
variables, each drawn from the bimodal distribution: 

{a with probability P(fj) = 1 — - 
, C (22) 
1 with probability P(fj) = ^ 

As it has been already said, this approximation is expected to be exact only in the appropriately taken c — *■ oo limit. 
Nevertheless, for finite c, the results on the BM model can serve as a useful benchmark for original LCS model to 
decide if indeed the correlations between fj m ^ n are important or not. 

Note that the problem under discussion can be redefined as follows. Consider a matrix fj of size m x n and let the 
elements of this matrix be independent random variables with bimodal distribution <\'2'2\i. Consider now all directed 
paths in this matrix, i.e. ordered sequences {(mi, ni); (1712, n.2); (nik, Jifc)} such that m, > mj_i and i%i > ri£_i for 
i = 2, k. Calculating the ground state energy of the matching problem is obviously equivalent to maximizing the 
sum of the matrix elements along these directed trajectories: 

k 

E m ,n{a) = max J~]f)mi,n t (23) 

all sequences z — * 
i=0 

In Fig[4]we show an example of the evolution of the optimal path with the increase of a for some particular random 
distribution of weights "a" and "1" (shown by white and grey squares respectively) corresponding to c = 4. 
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Figure 4: An example of a random distribution of "l"s (gray squares) and "a"s (white squares) on a 20 x 20 matrix with 
c = 4. The optimal path for a — is shown by the thick line, the diagonal optimal path for a = 1 - by the dashed line and the 
evolution of the optimal path with increase of a - by thin line. The "l"s and "a"s lying on the optimal paths are additionally 
marked by filled and open circles, respectively. See the main text for more details. 

The optimal path for small a is drawn in bold in Fig(5] With the increase of a, the first change in the optimal 
path configuration happens at a = ^ when a shortcut I (shown by a thin line) is formed instead of the corresponding 
section of the bold line. Then, at a — \ the shortcut marked by II actuates, then at a = | the one marked by III 
comes into play. So, for a > | the optimal path is III-I-II. In what follows we call this kind of path subdiogonal, 
meaning that it goes only through the diagonal of the matrix (a* j for i = 1, n) and one of its subdiagonals (ea^i+i, 
or ttj+i,, for i — 1, ...,n— 1). Finally, at a = | the subdiagonal path III-I-II ceases to be the optimal one, and optimal 
path sticks to the diagonal (dashed line) where it stays up to a = 1. 
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IV. EXPECTATIONS OF LCS ENERGY FOR GENERAL COST FUNCTION S 



In this Section we consider the dependence of the ground state energy on the parameter a defined in Eqs. ([2"Tj) -(fT2 |) . 
We start with the consideration of the limiting cases: (i) a < 1 and (ii) e = 1 — a <C 1 and then, with the physical 
insight in hands, proceed to the semi-quantitative consideration of the general case. 



A. The case < a = - < 1 



In the limit a = 0, as we have mentioned before, the problem under consideration corresponds exactly to the 
simplest version of the Longest Increasing Subsequence (LCS) problem, where the mismatches have no cost at all. 
The Bernoulli Matching model for this problem has been considered in details in [35| . An example of the random 
matrix with the optimal path is outlined by the bold line in Figf5] (only filled circles, i.e. points with the weight equal 
to 1 are relevant in this case). We know that the ground state energy, E m ,n, as a function of the chain lengths m,n 
behaves asymptotically for large m and n as 



iJpmn — pOm + n) (pmn) 1 ^ 6 
t (c,a = 0) = — — 



q q 



(1 +p) — J^—(m + n) 



2/3 

X (24) 



where p = cT 1 , q = 1 — p and \ is a random variable with the Tracy- Widom distribution [361 ]. The ground state 
energy, E, n , n (a — 0), has a meaning of the LIS length of "1" (see [Hj|). The mean value (E m , n ) in the thermodynamic 
limit n — m — > oo equals to 

(E m>n ) ee (E n . n ) = = —^n (25) 

q l + v c 

Consider now the case of finite a = — paying special attention to the effects of finite values of m, n on typical 
fluctuations of E. We assume below m = n for simplicity. 

If the value of a is small but finite (0 < a = — <S 1, the meaning of "small" is specified below), then the trajectory 
of the optimal matching path does not change with respect to the case of a = 0. The only difference from the a = 
case is that there are mismatches inserted between the matches whenever it is possible (see open circles along the bold 
line in FigU]). It is not difficult to estimate the number of such inserted mismatches. Namely, the typical distance 
(d) between the consequent " 1" (i.e. gray squares) along the optimal path in Figf4] projected to the horizontal and 
vertical axes is, correspondingly (m^+i — and (n^+i — Hi) . The value of (d) is dictated by the density of black 
circles along the optimal path (see fig.FigU]). For m = n — > oo one has 

(d) = {m i+ i - mi) = (m+i - m) = " = 1 + (26) 

The average energy gain due to a's (i.e. white squares in Figf?]) inserted into the optimal path can be estimated as 
follows 

(AE) = (E n . n ) ( (min[m i+ i - m. h n i+1 - rii]) -lja (27) 

Indeed, we can insert a white square into the optimal path between consequent gray squares if and only if the distance 
between these consequent gray squares in each of the dimensions is bigger or equal than two (we measure the distance 
in elementary squares). Let us estimate {AE) from above and from below. 

1. The upper bound corresponds to the assumption that the increments of m and n are fully correlated. In this case 
(min[mj + i — mj, ni+i — rij]} = (d) with (d) computed in (|26[) . Therefore, for (AE) we obtain the following estimate 



(AE) < (E n , n ) ( (d) - 1 )o = ( 1 - ^-^= ) na (28) 



2. The construction of the lower bound corresponds to the assumption that the increments of m and n are completely 
independent. The computations in this case are slightly more involved since we have to compute explicitly the average 
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value of the minimum d mul of two independent increments m and n. The computations presented in the Appendix lAl 
lead us to the following lower bound of (AS): 

(AE) > (E n>n ) ( (d min ) - l) a = (±±j£ ^-^j na (29) 

Collecting (|28|) and (|29|) we arrive at the following bilateral estimate of (AE) for 0<aCl: 

1 + 2 \ (AE 1 ) / 2 \ 
- = = a < < 1 -= a (30) 

It is worthwhile to notice in advance that, according to the numerical simulations, the genuine values of (AE) jn 
are actually very close to the lower bound (|29[) . 



B. The case a=l-e(0<e<l) 

Turn now to the opposite situation, a = 1 — e (0 < e <C 1). For e = the situation is trivial. Indeed, there is no 
difference between "l"s and "a"s (i.e., gray and white squares at FigUare identical) and the optimal path is thus the 
diagonal one with the energy 

E(m, to) = min[m, to]; E(n,n) = n (31) 

Now, for small but finite e and not too long trajectories, n (the definition of "not too long" is, once again, to be 
given below), the longest possible path still sticks to the main diagonal (see Figd]). This path is optimal with the 
ground state energy given by 

£ r f ag (a) =7i -fee (32) 
where k is the number of a's on the diagonal, which is a random variable distributed with the binomial law 

(recall that q = 1 — - and p = i). Hence the average energy (i?^ lag ) per monomer on the diagonal path equals 

l(E^) = l- { -^e=l-(l-p)e (34) 
n n 

Let us estimate now the length, n^, on which the optimal path detaches from the main diagonal. The optimal path 
of length n is separated from each of the suboptimal ones (i.e., those of length n — 1) by the energy gap SE: 

5E=(n-ke)-(n-l- k'e) = l-eSk (35) 

where Sk = k — k' is the difference in the number of a's on the optimal (diagonal) path and on the best of the 
suboptimal paths of lengths n — 1 (see FigH|). The optimal path detaches from the diagonal when 5E < 0. Since Sk 
cannot exceed n — 1, the diagonal path is always optimal until 

l-e5k<0 => l-(n d -l)e<0 n d > e" 1 + 1 (36) 

where is the length of the optimal path which detaches from the diagonal at energy e. The inequality (|36|) gives 
rather crude lower bound for the value of n for which the detachment of the optimal path from the diagonal actually 
happens. To acquire better bounds we should take into account the concurrent effects involved. On one hand, 
the single diagonal path has the advantage of being the longest one. The corresponding value of k has a binomial 
distribution ([33]) with the mean (k) = nq. On the other hand, the suboptimal paths (i.e., those of lengths n — 1) 
are disadvantageous because they are shorter, however their intrinsic advantage consists in high degeneracy: one has 
many such suboptimal trajectories. The number k! of a's on each particular suboptimal path is a binomial distributed 
random variable with the probability density W(k' , to — 1) and the mean (k') = (to — l)g. Now we have to find the best 
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(i.e. the minimal) value (k') among AT suboptimal paths. These suboptimal paths (there are AT ~ n 2 /2 of them) are, 
however, not independent. It is easy to understand that the number of independent suboptimal paths, A^ n( j, satisfies 
the following bilateral inequality: 

2 < N ind < 3n - 2 (37) 

Indeed, on one hand, there are at least 2 independent paths coinciding with upper and lower subdiagonals. On the 
other hand, by definition, the suboptimal paths can visit only these two subdiagonals and the main diagonal itself. 
The corresponding energetic costs are therefore always linear combinations of the values on the diagonal (n) and two 
subdiagonals ((n — 1)), that is, n + 2{n — 1) = 3n — 2 accessible matrix elements, which are themselves independent 
random variables. Evidently one cannot construct more than 3n — 2 independent linear combinations out of 3n — 2 
independent variables. We are, hence, to compute the average minimum of N- lnd independent random quantities each 
distributed with the probability density W(k',n — 1). This task is solved in Appendix IBl Taking into account the 
inequality (|37[) which defines the boundaries of A^ n( j, we can get the upper and lower estimates for (SkN ind ) (n^> 1), 
where (<5fcjv ind ) is defined as follows: 

(Sk Nind ) = (k) - (k' Ni J ee npq - (k' Ni J (38) 
Substituting into (|38|) the expressions derived in Appendix IBl for (fc^y. ), we have: 

q+ _L ( npg )l/2 < ( § k Nind ) <q+(2npq)^ 2 [in (in^^pq) 1 / 2 )] ^ (39) 

Remembering now that the optimal path detaches from the diagonal at (<5fc/v ind ) ~ e , and dropping out all constants 
of order of one, we arrive for n> 1 at the following approximate bilateral estimate for the detachment length, 

n d < (e 2 pq) 1 < n d lnn d (40) 



In FigO we show the results of our computer simulation of the average energy of the optimal path as a function 
of the sequence length for different values of a and p. One notes the crossover (for fixed a and p) from the path 
sticking to the diagonal at low n and the high-n regime, where the optimal path is detached. For n> 1 the average 
energy of the path eventually saturates at some value E^, which is a- and p- dependent. Moreover, though the 
detachment point is not exactly well-defined, the rescaling according to the inequality (|40p shows that it gives rather 
decent estimate of the detachment point. Note also that the plateau region persists up to quite large values of e. 
Indeed, it is easy to see from ([36]) that the detachment happens at rid > 2 (and thus a plateau of at least two points 
exists) for any a > ad = 1/2. It is less obvious and more important, however, that the more accurate estimate (|40|) is 
still relevant in the whole range of a £ (1/2, 1). 



/ 



E 

0.6- 



E 

0.6- 



e 2 np(1-p) 
(a) 



01 01 



s 2 np(1-p) In n 
(b) 



e 2 np(1-p) 
(c) 



Figure 5: The dependence of the reduced mean energy E — ((e n ) — (eo))/((eoo) — (eo)) of the optimal path on the reduced size 
n of the system, (a) for c = 4 and e = 0.3 (black squares), e = 0.2 (red circles), and e = 0.1 (blue triangles); (b) and (c) for 
e = 0.2 and c = 2 (black squares), c = 8 (green circles), and c = 32 (magenta triangles). Note that curves for c = 2,8 almost 
collapse after rescaling prescribed by the r.h.s of (|40j) . while those for c = 8, 32 collapse with rescaling prescribed by the l.h.s. 
of Igbjl. 
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C. The general case a £ [0, 1]: energy cost and fluctuations. 



Consider now the general case of a £ [0,1]. In Fig [6] we present the estimates of the average ground state en- 
ergy (e n (c, a)) = (E n (c,a)) /n for different values of c and a. These estimates we obtain by the finite size scaling 
extrapolating (e n (c, a)) from large, but finite, n to n — > oo. 



1.0 
08 



* . * 



0.5 

(a) 




(b) 



(c) 



Figure 6: (a) The limiting value of the ground state energy per cite (e n (a)) = (E(a)) /n as a function of a for different c: 
c — 2 (squares), c = 4 (circles), c = 8 (up triangles), c = 16 (diamonds), c = 32 (down triangles); (b) The upper (dashed 
line) and lower (thin solid lines) bounds and the hyperbolic fit (thick line) of the (e n (a) = (E(a)) /n) dependence for c = 4 



(circles) and c = 16 (diamonds); (c) The examples of the ground state energy per cite (e n ) as a function of : 



-2/3 



(thick line) 



and the finite-size scaling fits used to obtain points in the figure a) (thin lines) for several different values of a and c, line 1: 
a = 0.2, c = 4, line 2: a = 0.6, c = 8, line 3: a = 0.7, c = 16. 



In our construction we use the following conjecture. One sees from 



that at a = and for m — n 3> 1 the 



average ground state energy, (e n (c, a = 0)), converges to its value at infinity, (e^c, a = 0)) 
exponent a = —2/3: 



, with the scaling 



(e„(c, a = 0)) = - (E n , n (c, a = 0)) = — ^= + 1} <x) ^ 2/3 = 

n 1 + V c v c + 1 



,( Cj a = 0)>+/(c) (x) n c 



(41) 



where (x) = -1.7711... (see (U). 

We assume that the critical exponent a is a-independent and the finite size scaling of (e„(c, a)) for a > and n>l 
reads (see also H3) 



(e„(c, a)) = (e^ic, a)) + g(c, a) (x) n a 



(42) 



where g(c,a) is some function of c and a, but not of n. Extrapolating the data of (e ra (c, a)) computed numerically 
for large finite n to (e^c, a)) on the basis of finite size scaling (|42[) . we arrive at the family of curves (eoo(c, a)) for 
c = 2, 4, 8, 16, 32, 64 shown in Fig[6^,b. The results presented in Figj6fc, as well as those of j37[ demonstrate that the 
conjecture (|42[) is actually plausible. Apart from the points obtained by numerical simulation, in FigO) we depict: a) 
the estimates for (e^c, a)) at small a given by the inequality (|3U)) . and b) the estimates of (e^c, a)} on the plateau 
for 1 (Eq.([3l|l). 

One should note that the numerical results for a <C 1 are very close to the lower bound of ([50)) . We use this fact to 
produce a fit for the dependence (e^c, a)) in the whole range of parameter a £ [0, 1] for few values of c (c = 4, 16, 64). 
Namely, we fit the data of (e 00(a)) by a hyperbola of general form 



((eoo(a)) + K\a + Si) ((e 00(a)) + n 2 a + 5 2 ) = R 



(43) 



with the constraints that this hyperbola passes through the points (a, eoo(a)) = (0, 2/( v / c + 1)) at a = 0, and 
(a, eoo(a)) = (1, 1) at a = 1 with the slopes given by limiting linear approximations ([50]) and (p£4"| correspondingly 
These four constraints leave us effectively with only one free parameter, which we change to arrive at the best fit of 
the experimental data. As one sees from FiglBJa, the found fits for different values of c are quite good. 

Let us now discuss briefly the fluctuations of the average free energy and their dependence on n. One expects for 
n^$> 1 the average fluctuations a\ to be proportional to n 2 / 3 , typical for the Kardar-Parisi-Zhang universality class 
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[371 ] . This conjecture is consistent with the computation of the fluctuations of the averaged length of the Longest 
Common Subsequence (LCS) in the a = limit for Bernoulli Matching model (see [HI): 

o%(n) = Var£„,„(c) = « n (c)) - (^,«(c)) 2 « (<X 2 ) - <x) 2 ) / 2 (c) ^ (44) 

where O = 2/3 and (x 2 ) - (x) 2 = 0.8132 .... 

The behavior for intermediate values of n is more involved. In particular, for small a and intermediate n one expects 
for a E (n) the growth with the critical exponent 6i : 

<4(n)~n* (45) 

The exponent #i is known to be typical for the "transitional" regime in the (1+1)D KPZ equation [H, H^]. In terms 
of the work [3§| the exponent 9\, which governs the short-time behavior of the correlation function of KPZ model, is 
01 = [d + 4)/z — 2, where z is the dynamic exponent [3^ ]. and d is the space dimensionality. In d = 1 the value of z 
for KPZ model is known exactly, z = 3/2, giving the value 6\ = 4/3. 

For a = 1 — e (e <C 1) the plateau regime for e„(c, a) exists at low n < (where is defined in (|40p ). The 
arguments of Section IIVBI allow us to expect the in this case the variance cr E (n) behaves as 

a 2 E (n)~n e2 (46) 

with the Gaussian exponent 02 = 1 since the plateau energy is just the sum of n independent random variables. 

The numerical results presented in Figdfor cr%(n) fully confirm the behaviors (|44|) . ([45]) and (|46]) . In the case of 
intermediate a shown in Figl?h the sequence of regimes, at least for large c is more reach: we first note the exponent 
$2 = 1 (plateau), then the exponent 6\ = 4/3 ("transitional" KPZ), and finally the exponent 6 = 2/3 (large scale 
KPZ). It looks like the growing plateau region continuously "swallows up" the finite-size KPZ region with the increase 
of a, and thus at e = 1 — a <^ 1 one sees only two regimes. 




e 2 np(1-p) e 2 np(1-p) e 2 np(1-p) 

(a) (b) (c) 

Figure 7: The dispersion oe of the ground state energy as a function of iV for different values of a and c. (a) a — 0.7, (b) 
a — 0.2, (c) a = 0.4. In all figures c = 2, 4, 8, 16, 32 in ascending order. 

The important question concerning the results presented above is how universal are they with respect to the change 
of the model settings. Indeed, the exponents #0,1,2 ar c obtained in the assumption of Bernoulli matching, i.e. assuming 
the contact matrix j3 doesn't have any correlations in it. As it was mentioned above, in reality it is not the case. 
Though wc assume that some of the exponents may be universal, and may even hold in the case when there are 
strong long-range correlations in the structure of the associating DNA strands, the situation is a priori very unclear, 
a thorough numerical investigation of this question would be, in our opinion, an essential contribution in the field of 
the KPZ-relatcd studies. 



V. CONCLUSION 



In this work we have analyzed the average normalized ground state energy, e, of the complex of two random 
hctcropolymers with quenched sequences as a function of chain length, n, in the ensemble of chains with uniform 
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distribution of primary structures. The main attention is paid to the behavior of the function e(n) at intermediate 
chain lengths and low temperatures. 

The dependence (e„) is shown in FigO Besides the formal estimates of the boundaries (|30| . ([34]) . and of the 
crossover length, (Eq.([lD|)). it seems to be desirable to acquire the qualitative understanding of the zipping energy 
(e n ) for different chain lengths and different values of a. 

One sees that the normalized energy (e„) for relatively long (n > rid) zipped chain configurations, is larger than 
the corresponding energy in a hairpin state for n < n&. The reason for this result is as follows. Longer chains could 
optimize their energy matching via loops creation while for short chains the penalty for loop formation is forbiddingly 
large. Hence the inequality (|40|) gives the criterium for characteristic scale length which separates two kinds of 
structure behavior: short chains form the hairpin configuration in which the monomers are forced to bond without 
any regard of their species, while long chains are capable of adjusting their spatial configurations by loop formation to 
obtain better matching. The crossover around is, thus, separating the small n region where the energy approaches 
the plateau value exponentially fast with decreasing n, and infinitely large region of increasing (e„) where it 
approaches its value at n — ► oo with the power low dependence (eoo) — (e n ) ~ n -2 / 3 . This behavior of (e„) depends 
only qualitatively (see (|40|) ) on the parameter a for sufficiently large a > ~ 0.5. 

The unzipping process of two random heteropolymer chains is schematically shown in Fig|8] The results of previous 
sections allow us to find the dependence of the force f(x) per chain monomer, on an average extension distance, x, 
between chain ends. If N is the total length of each heteropolymer chain, and n is the average current length of the 
heteropolymer complex measured from its common bottom end (see FigJSJ, then by construction, x = 2(N — n). For 
the sake of simplicity, we neglect here the fluctuations of the unzipped regions of the chain. 



x 




Figure 8: Unzipping of two random heteropolymers. 



The plot of the average force, /, per chain monomer on the average separation distance, x, is shown in Figj9] To 
be precise, /(x), is the force necessary to unzip two random heteropolymer chains whose ends are separated by the 
distance x averaged over all equally distributed primary structures at low temperatures for fixed value a = ^ and 
given number of letters in the alphabet, c. The function /(x) can be easily obtained from the dependence (e„) shown 
in FigH Namely, /(x) = ±(n (e n » at n = N - x/2. 

Qualitative explanation of this phenomenon repeats the above discussion of the ground state free energy (e„) . As 
it has been said already, the main attention in our work is paid to relatively small n, i.e. large average separation 
distances, x. (For discussions of the peculiarities of the force on the other bound, i.e. at x — > 0, see @.) When 
x approaches the contour length, 2N , the equilibrium unzipping force /(x) gradually decreases as const — ?i~ 2 / 3 = 
const — (N — x/2) -2 / 3 until N — x/2 ~ nd when the force drops further down to reach the limiting plateau value (|34|) 
where it saturates independently of further increase of x. 

Let us stress once more that the result obtained is valid only for values of /(x) averaged over the ensemble of 
realizations of different heteropolymer sequences: for any given heteropolymer sequence, the equilibrium force would 
be a highly fluctuating function of the distance x. In reality, moreover, the unzipping experiments are often set up in 
the fixed-force ensemble, instead of the fixed-distance one (see, for example, j4fj|.l4l|). i.e. the constant force is applied 
to the ends of the chain, and the dynamics of the unzipping under this constant force is studied. In such a setting 
the knowledge of the characteristic occupation times for the intermediate states allows to reconstruct the overall free 
energy landscape. We predict that after the averaging over many realizations of such an experiment with different 
primary structures, one expects the typical occupation times for almost unzipped intermediate states to be less than 
those for the almost zipped conformations (the particular difference depends on the applied force). Correspondingly, 
the life time of the intermediate states is gradually decreasing with the increase of x until saturating at N — x/2 ~ n^. 
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600 400 



n=N-x/2 



n=N-x/2 



(a) 



(b) 



Figure 9: Dependence of unzipping force, /, per chain monomer on average separation distance, x. (a) log-linear scale, (b) 
linear scale. 
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Appendix A: AVERAGE VALUE OF THE MINIMUM OF TWO INDEPENDENT INCREMENTS 

First of all we should make a conjecture about the distribution of intervals d m = TOj+i — m, and d n = 7ij+i — 7ij. It 
seems to be rather natural to suppose that the intervals d m ^ n have the exponential distribution, i.e. p(d m . n ) ~ e~ kdrn < n 
(one can easily check that at least the tails of this distribution arc indeed exponential). Normalizing p(d m . n ), we get 



-kd„ 



p(dn 



(e k - l)e 



-kd„ 



(Al) 



n n = l 



kd„ 



The mean values (d m ) and (d n ) are 



(dm) = (d„,) = ^ d mjn p(d n 
d mn =l 



e k - 1 



(A2) 



Now we are to find the averaged joined minimum (d m i n ) of two random variables d m and d n distributed with (|A1[) . 
To do that we proceed as follows. First of all find the discrete integral distribution function, F\(z), for each random 
distribution, p(d m ) and p(d n ): 



Fx(z) 



d m „=1 



p{d m ,n) = 1 - e 



-kz 



Following the general procedure, define now the joined discrete integral distribution function, ^2(^)7 

F 2 (z)^l-{l-F l {z)f = l-e- 2kz 



(A3) 



(A4) 



Taking the discrete derivative, pi(z) = Fq,(z) — F^(z — 1), we find the probability distribution, p%(z = d) for the 
minimum d m i n = min[mi+i — to;, Tij+i — ni\. The last step consists in taking average (<i m i n ) with respect to the joined 
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distribution function P2(d): 

oo 

(drain) = ^ Z P 2 ( Z ) = £ 2k _ \ ( A5 ) 



e 2k 



Collecting ([26]), (|A2J) and (jASJ), we get 



1 + a/c 



e k - 1 2 

g 2fe 

e 2& _ 2 = \^ mir 



and thus, resolving (|A6 



Substituting (|A7|) into (J27J) one obtains finally the estimate of AS from below: 



(A6) 



<**0 - (A7) 
4Vc 



( ^y^r - YTTc ' 



Appendix B: AUXILIARY CONSTRUCTION FOR ESTIMATION OF THE DETACHMENT LENGTH 



Assuming — ^> 1 one can replace the binomial distribution ([33)) with the Gaussian one and approximate W(k) as 
follows 



W(k,n) 



1 ( (fc-(fc)) 2 

exp 



y/2iTnpq 



2npq 



(Bl) 



where (fc) = ng. The distribution function for the variable k' W(k',n — 1) is completely similar but the replacement 
n — > n — 1. 

We are now to compute the mean minimal value (k' N . ) of iV; n( j random variables, each distributed with W(k' , n — 
1) = ^(fc'). Repeating the same procedure as in the Appendix [Aj we proceed as follows. First of all pass to the 
integral distribution function, F(z): 



F(z) 



W(k')dk' = - ( 1 + erf 



z — (n — l)q 
y/2(n-l)pq 



Now construct the new probability distribution function, Q(z), for the joint distribution, as follows: 



dz 



Q(z) = 1 - (1 - F(z)) N ™ = N ind F'(z) (1 - F(z)) N ^-' 



The desired mean minimal value (k' N . ) reads now 



zQ(z)dz 



(B2) 



(B3) 



(B4) 



Now, taking the estimate (|37|) into account one readily arrives to the lower bound for (k' N . ). Indeed, for N-yad = 2 

1/2 



Wind 



-i= y ( yv /2(n - 1) M + (n - l)g) e" 1 ' 2 (l - erf (y)) = (n - l)g - ((n - l)pg) ' 



(B5) 



For A^ n( i = 3n — 2 3> 1 (sec (|37|l) the integral (|B4[) cannot be computed analytically and therefore one needs to apply 
some approximative approach. We proceed as follows. The function F{z) has a sense of the area under the curve W(k) 
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in the interval k' e (—00, z]. Consider now iV mc i 3> 1 independent random variables each distributed with W(k'). For 
—, ^" < -1 on average one point of Mnd equally distributed random points lies in the area F(z) ~ N,~l. Since 

y/(n-l)pq 

this area is the area under the left tail of the distribution W{k'), the point inside this area is the minimal one by 
construction. So, expanding F(z) for z ~("~ 1 ) 1 ? 



F(z) 



1 



1 + erf 



y/(n-l)pq 

z — (n — l)q 
y/2(n-l)pq 



<C — 1, we get 



y/(n - l)pq 



■ exp 



(z-(n-l)q)' 
2(n - l)pq 



1 



AT, 



ind 



(B6) 



'2n((n - l)q - z) 

Since the term in the exponent in (|B6|) varies much faster than the pre-exponential term, we can roughly estimate 
j w ™ k ^ as follows 

1/2 



U max 
K N iad 



(n-l)g- [2(n-l)pq 



1/2 



In [N iad [(n-l)pq 



1/2 



(B7) 



Note that Eq. (|B7p is obtained from Eq. (|B6j) under the condition z < (n — l)q which fixes the right sign of the square 
root branch of the second term in Eq. (|B7j) . 



Substituting iVi n d = 3n — 2 into (|B7|) and taking into account that n 1, we get the following desired estimate for 



max 

Wind 



(4™ x ) ~ (n - l)q - {2npq) 1 ' 2 [in (n^ 2 (pq) 1 ' 2 ) 



1/2 



(B8) 



Now we can use the boundaries (|B5|) and (|B8[) for getting lower and upper bounds of (<5fcjv ind ) and of the detachment 
length, n d - see Eqs. (p9)) -(|40l). 
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